Can Regression Algorithms Predict Wine Quality as Skillfully as Classification Algorithms?

Author: Joe Cerniglia
Date: April 4, 2022

In 2008, four researchers from Portugal published a paper in which they reported their results of the use of machine learning algorithms in R to model the preferences of professional wine-tasters for wines from the vinho verde region. Because of the differences between red and white wines, the authors split their data into separate datasets, which may be found here at the website of the UCI Data Repository. The intent of the study was to use machine learning to assist in the certification process of professional testers and to help stratify premium wines for price-setting purposes.

Although the problem is one of classification, the authors found that SVM, a regression algorithm, outperformed the classification algorithm of Neural Networks. Thus, they chose SVM as their favored model for vinho verde wine modeling.

In selecting my first machine learning project, I wanted a topic that was not inherently controversial, not subject to privacy or data protection concerns, had popular appeal, and most importantly, would encourage experimentation and courage in spite of possible error on my part.

My project has two objectives:

  1. To the extent possible, attempt to duplicate the regression result of the study authors, using Python machine learning libraries.
  2. Using classification methods such as decision trees, determine whether the precision and recall achieved by the original study may be improved.

The code in this notebook may be used to evaluate either white wine or red wine. For this study, I have chosen to focus on white wine because it is the more robust of the two datasets, and because, as the paper authors relate, white vinho verde wine is the more predominant Portuguese export. Also, because the notebook takes several minutes to run to completion, it seemed appropriate to choose a single wine varietal for the models to evaluate, rather than require that they evaluate them both.

Note also that because the models are stochastic in nature, results will vary by small amounts each time the notebook is run.

The first step is to import relevant Python libraries.

Include logic for Regression Error Characteristic Curves (REC curves) by Amirhessam Tahmassebi.

REC curves are one of the primary metrics the study authors use to compare machine learning algorithms. As such, any attempt to duplicate the authors' result must have them. Note I have modified Tahmassebi's logic slightly by simplifying the formula to calculate the curve, and I have added in flexibility regarding the limit used to plot the x-axis and the ability to add a title to the graph.

Here is a function to convert continuous float-formatted (decimal-formatted) predictions of a regression model to discrete classification predictions with integer values.

The study authors used similar logic. To build their REC curve graphs, they used the continuous predicted values they obtained from their regression models. To build their confusion matrices (these contain precision, recall and F1-scores), which require discrete values as input, they used 'tolerance' factors that formed the basis of the conversion that this function performs.

Read in the white wine data and examine the dataset attributes.

The most noticeable aspect of the data is the imbalance in quality ratings. Very good and poor wines are rare, comprising only 3.7% of the white wine sample. Wines on either extreme of the quality spectrum are the ones that are of most interest, but they are also the hardest for machine learning algorithms to classify.

The correlations between the individual features of this dataset and quality are not strong, but in most cases are rather subtle. Alcohol levels moderately correlate with quality. Alcohol reduces liquid density; thus, these two are negatively correlated. Sugar increases density; thus, these two are positively correlated.

The box plots show a fair amount of spread among the individual feature values. This may indicate that models will be able to distinguish and separate clusters of features in the feature space, which may assist in classifying the various gradations of wines.

Prepare data by separating the training and validation (test) datasets.

An 80-20 train-test split was used.

Do an analysis of relative feature importance.

Knowledge of feature importance can be helpful in deciding whether to use all or some of the features in the model itself. The ability to elide certain features can improve the efficiency of the program and perhaps even the classification skill of the model.

While alcohol stands out among the rest of the features, probably due to its correlation with quality, there are no strong groupings of useful versus not useful features. It is not immediately clear which, if any, features, would not be relevant to a prospective machine learning model. (I note that these rankings are different than those arrived at by the study authors.)

Use RFE (Recursive Feature Elimination) to help determine whether some features can be sacrificed.

The decision of how many features should be selected is left to the programmer. I chose eight (8).

RFE appears to rank the features similarly to the way in which the bar graph above does.

Use the cross_val_score to evaluate the CART (Classification and Regression Decision Tree) model on all possible number of features.

A consensus seems to be forming that between 8 and 11 features selected will work well, but other tools remain available to let the model decide.

Obtain a baseline metric of which models perform best.

Plot the results of model performance on a box plot chart.

Linear Regression outperformed SVR, which is the reverse of what the study authors concluded. However, the data in this model has neither been scaled nor tuned. We can refine these performance results by scaling and tuning the data prior to running it through various models. Once this has been done, a different result, closer to what the study authors found, might be obtained.

Refine the model comparison by introducing the Standard Scaler into a pipeline for each of the same models. Produce REC (Regression Error Characteristic) curves to compare the models more precisely. The higher the Area Under Curve (AUC) score is, the better the model performs.

Note that we have not yet applied the function to transform the predicted values into discrete categorical values. Therefore, regression models SVR (Support Vector Regression) and LR (Linear Regression) show up in the REC curves as curves, rather than in the shape of a stairway. One might easily change these settings later on, but retaining the curves for the moment for these regression models allows a better comparison with Fig.3 from the study author's paper.

The box plots above show that the SVR outperforms LR, just as the study authors showed in 2009.

Before studying classification models in greater detail, however, let's fine-tune the SVR model the best that we can, to be sure that we are giving it a fair chance to demonstrate what it can do.

First, let's rerun the SVR model above. This time, we will apply the function to transform the predicted values into discrete categorical values, which we will scale. The discrete values will allow the use of the eval_final function, which provides a more detailed report on performance of the SVR model, including a confusion matrix, which can only be provided from categorical input. We will not worry yet about cross validation or including the scale into a pipeline, since we only want a baseline performance result.

Now that we can see this expanded report that the eval_final() function provides, it becomes clear that no single metric can tell the full story of the performance of a model. Area under the REC curve can give us a baseline of information, but there are equally important metrics the above report provides. All the metrics provide information, but the most contextually important, in my opinion, are weighted recall, Kappa, Mean Average Deviation (MAD), and, critical for fine-tuning, the recall of quality level-4 (a poorer wine), and the recall of quality level-8 (an excellent wine). These last two may be the most important of all, since they distinguish a model by its ability to classify what is known as 'the positive class,' in other words, the class of values (poor, excellent) in which we are most interested. Any model should be able to perform reasonably well on the middle classes of wines, since they are the vast majority of the wines in the database and pure chance would favor the model as having 'guessed' correctly. The extremes of the wine spectrum are the most challenging, and therefore reveal the 'true skill' of any model.

We may begin a fair comparison by assembling the results of the SVR assessment into a Pandas dataframe. For good measurement, we should include these metrics from the original 2009 paper as well.

So far, we have been able to approximate the results of the 2009 study in the SVR model, but since original code algorithms from 2009 are not available, we will not be able to duplicate the 2009 result. Nevertheless, the Pandas dataframe provides a baseline of results, which we can supplement as we test additional models.

Tuning the hyperparameters of SVR

When we run the SVR model as if it were a categorical algorithm, we find that its performance in its REC curve appears to be just as good as for the CART model. Can the SVR model be improved even further by tuning its hyperparameters? The choice of the range of parameters on which to search was partly dictated by considerations of processing time. Searches of too wide a range of values will take much longer to complete, and the gains in performance would be marginal.

Check optimal number of features again.

As an interim step, we want to ensure that the optimal number of features to evaluate has remained constant even with tuning the SVR model. Plug the best hyperparameter values obtained above into the SVR model. Evaluate with a repeated stratified kfold, thus ensuring a statistically valid mean comparison value. Use the SelectKBest function for an automatic selection of an optimal number of features to select.

We find from the box plots above that the optimal number of features has remained steady, despite some variation in the lower ranges. We want the optimal number, so therefore the fact that three (3) features is apparently more optimal than five (5) features is of little consequence. We will be selecting seven (7) or more features.

Now we can submit a fine-tuned SVR model that we may then compare with the results of the CART Regressor.

The following code combines Principal Component Analysis with SelectKBest to supply an optimal number of features to select. We also scale the training set using the standard scaler. The scaling is performed within a pipeline on each kfold. This is done to prevent data leakage, an undue bias that would otherwise be introduced by scaling the entire training dataset prior to fitting of the model. Finally, the tuned SVR model is fit to each kfold, using the optimal hyperparameters identified in the prior step.

Tuning and scaling the SVR model within a pipeline yields an AUC (area under the REC curve) of 0.799, higher than the CART Regressor and the previous untuned SVR model. We can add this result to our collection.

We find tuning and scaling the SVR model has caused an improvement in several of the metrics but this has occurred at the expense of recall-4 and recall-8, a common phenomenon. The SVR from the original 2009 study still dominates our beginner's attempts at classification.

Take a look at K-Nearest Neighbors.

KNN ranked second in our initial box-plot comparison of model skill. It should not be overlooked. Perhaps tuning the KNN model can yield satisfactory results.

KNN has not scored better than our previous attempts, even after tuning and scaling. It may be that KNN is poorly suited for exposing the structure of this dataset. Nevertheless, let's add KNN to our collection.

Test a number of different classification models together.

If we are to exceed the model results of the original 2009 study, we will need to do better. A good first step is to compare a number of different classification models simultaneously, this time with scaling, to see which hold the most promise.

Although this procedure did not provide detailed reports for each model, the box plot above reveals that decision trees (Bagged Tree, Random Forest, and Extra Trees) appear to hold a decided advantage in MSE (mean squared error). They seemed to outperform all of the other classification models. This is something that needs to be investigated further. It indicates, perhaps, that better machine learning algorithms may be available for the work of classifying these wines of the vinho verde region of Portugal. We will study decision trees more closely, to discover whether a better model can be obtained than the one from the 2009 study. But first...

Make a report on SVC class-weighted.

This model seemed to perform poorly, but it would be interesting to add this to our collection, for the sake of diversity.

Even though its weighted recall fell below that of the other models, the decline seemed to be to the benefit of recall-4 and recall-8, which are the best seen so far. It will be interesting to see whether decision trees can improve on recall-4 and recall-8 while also not sacrificing the accuracy of classifying the middle classes.

Next, let's take a look at scaled CART with best features, a baseline decision tree.

Adding CART to the collection

CART shows itself to be much the equal of SVR from the 2009 study, with a higher KAPPA, equivalent Mean Absolute Deviation, and a recall-4 and recall-8 that is acceptable. Could other decision trees improve upon this result even more?

Experiment with Balanced Bagged Decision Trees.

Balanced bagged decision trees introduce undersampling of the majority classes in an attempt to improve resolution of the minority. This is but one method of improving the classification skill of imbalanced datasets.

The balanced bagging result has the best recall-4 of any model thus far, but this improvement has come at the expense of nearly every other metric. Still, we can add this result to the collection.

Balanced bagged decision trees had very good recall-4 but this came at the expense of the middle range.

Let's try bagged decision trees without the refinement of balancing with undersampling of the majority.

Adding Bagged Decision Trees to the collection:

This is much the best result yet, improving on the 2009 study and prior models attempted. The recall-4 leaves a little bit to be desired, but maybe additional models can provide even further enhancement.

Random Forests enjoy a good reputation among decision trees. Perhaps good results can be obtained from its use:

Adding Random Forests to the collection:

Again, we find that our decision trees are yielding excellent results, surpassing recall-4 and recall-8 of the original study and maintaining equal or better classification accuracy for the midrange.

Improvements on Random Forests will be incremental, but perhaps even better classification accuracy of the extreme categories of quality will be possible with oversampling techniques.

To begin, let's try using the random oversampling method on the Extra Trees classifier. We will include a graph of how the quality ratings are distributed by pH (x-axis) and by alcohol percent (y-axis). Note that the graphs representing before and after random oversampling will not appear any different. Random oversampling does not create new examples but samples only by adding existing examples to the current example vectors. This is known as sampling "with replacement."

Here are the results for random oversampling with Extra Trees.

Adding random oversampling with Extra Trees to the collection:

Although the gains are not as significant, the Extra Trees model with random oversampling achieved improvements in nearly every metric. Notably, the recall-8 increased to 42.9%.

Other oversampling methods exist. BorderlineSMOTE (Synthetic Minority Oversampling) is an algorithm that adds synthetic examples to the dataset that improve the resolving power of the classification.

Adding Extra Trees with Borderline Smote to the collection:

Although the weighted overall recall fell, the recall-4 and recall-8 both increased. This result currently leads the rest of the pack.

SMOTE is a close relative of Borderline Smote. We will now compare its performance on the Extra Trees model.

Adding Extra Trees with SMOTE to the collection:

This Extra Trees result raised the recall-4 significantly. While the recall-8 decreased slightly, the rest of the metrics stayed much the same. We have now achieved a weighted recall incrementally better than that of the 2009 study, but noticeably better for recall of low-scoring and high-scoring wines. The Kappa ahead of the 2009 study, and the Mean Absolute Deviation is lower (good), indicating that the misclassified wines in the confusion matrix are grouped more tightly around their actual values than they were in the 2009 model. The number of 'near misses' has increased while the number of 'complete misses' has fallen.

Apply additional models experimentally.

There are many models we could try to improve upon the Extra Trees result, but we could never have enough space or time to apply all of them. Still, it is worth attempting a few to see whether slight improvements can be made to the classification of the positive classes.

Mahalanobis Distance Oversampling with Extra Trees

Note that as of this writing MDO cannot be integrated into a pipeline with ensemble methods. Extra Trees is considered an ensemble method. Therefore, for the sake of experimentation, we must oversample the data on the entire training dataset, and tolerate the data leakage that results.

Acknowledgement: The code for this method comes from the bachelor thesis titled 'Multi-imbalance: Python package for multi-class imbalance learning', by Jacek Grycza, Damian Horna, Hanna Klimczak, Kamil Plucínski, Poznan University of Technology, Poznan, Poland, 2020.

Instructions for its installation may be found here: https://github.com/damian-horna/multi-imbalance/blob/master/README.md

Adding Extra Trees with Mahalanobis Distance Oversampling (MDO) to the collection:

The MDO with Extra Trees overall performs well, but at the expense of a loss of accuracy in recall-4. Extra Trees with scaling and SMOTE still appears to be the winner, the model most likely to succeed in both imbalanced learning and in overall classification accuracy.

Mahalanobis Distance Oversampling with Random Forests

It seems worth checking on whether a different ensemble method could yield better results from imbalanced learning when associated with MDO. We will try pairing MDO with Random Forests.

Adding Random Forests with Mahalanobis Distance Oversampling (MDO) to the collection:

Mahalanobis Distance Oversampling with Random Forests is not as skillful as Extra Trees at classifying the imbalanced portion of the data. Extra Trees emerged early on from the box-plot comparisons as a potential favorite as compared with other models. It appears to be meeting those early expectations.

SMOTE Edited Nearest Neighbors (SMOTEENN) with random oversampling is a method of combining oversampling with undersampling.

This application undersamples only the majority classes, removing only those examples that are misclassified. The dataset as a whole is, at the same time, oversampled, both with the random oversampling method and the SMOTE method. Sometimes this combination of oversampling and undersampling can yield good results, but it is an experiment, and we will need to see. We will align these techniques with the Extra Trees model.

Adding Extra Trees with SMOTEENN with Oversampling to the collection:

The undersampling-oversampling strategy was skillful, but its imbalanced learning did not exceed that of Extra Trees with SMOTE.

Adaptive Synthetic Oversampling (ADASYN) is a variation of SMOTE that oversamples minority classes based upon their relative density.

This slight variation of SMOTE can sometimes yield decision boundaries that are easier for a model to recognize.

Adding ADASYN with Extra Trees to the collection:

ADASYN with Extra Trees achieved a competitive weighted recall and KAPPA, along with quite respectable Mean Absolute Deviation (MAD). Recall-4 is high at 36.4%, as is recall-8 at 42.9%. Overall, this model achieves an excellent blend of skillful imbalanced learning with overall accuracy and a low Mean Absolute Deviation.

AdaBoost Classification is a method of sampling in such a way as to bias the model toward misclassified examples.

Adding AdaBoost to the collection:

AdaBoost performed worse than any other model thus far tried. Its result, according to the Kappa score, is very close to that which could have been achieved by guessing. With more experience, I may come to learn whether some combination of sampling, modification of hyperparameters, or feature selection might improve the skill of this model.

Stochastic Gradient Boosting selects small sub-samples from the data and builds a decision tree from each one. The sub-sample approach is intended to counteract over-fitting.

Adding Stochastic Gradient Boosting to the collection:

Stochastic Gradient Boosting is fairly competitive with other models, even as compared with the 2009 study, but it does not achieve the levels of performance obtained with Extra Trees.

Combining calibration of probabilities with Extra Trees can harness the ability of a model to reflect true probability rather than a probability of prediction that may be biased towards the majority classes.

Adding Extra Trees with Calibrated Probabilities to the collection:

Calibrating the probabilities created a model that is highly competitive.

Experiment with Binary Classification

Many machine learning datasets have a simple division of the positive and the negative classes, in contrast to the wine dataset, which has four positive classes (3,4,8,9), and three negative classes (5,6,7). As an experiment, I would like to find out what would happen if the wine dataset were to be re-categorized as a binary, two-class dataset, positive or negative, rather than the multiple categories of data we have been using.

First, we will re-classify the Y training dataset by separating out the excellent wines (categories 8 or 9) from the rest. Then, we will try out an assortment of different models to see whether any are notably successful in classification with the new binary data arrangement.

The imbalance is still noticeable when the positive (excellent) wines are separated from all others. As previously, Extra Trees is found to perform more skillfully than the other models. We can supplement Extra Trees with oversampling in the hope of enhancing the model's skill at imbalanced classification to a greater extent than was possible in a multi-categorical arrangement.

Combine the binary categorization with SMOTE and calibrated probabilities. Discover whether this will improve the skill of the model at distinguishing an excellent wine from the rest.

These results highlight the paradox of accuracy. Most of the examples are in the negative class. Most of the predictions, too, are in the negative class. This yields a high accuracy overall, but, in this example, pure chance (and a knowledge of the chance probability of either class) would also have yielded high accuracy. While the recall of the negative class is not abysmal, it is no higher than the recall exhibited in multi-categorical models. Therefore, I would conclude that there is no intrinsic benefit in re-casting this data as a binary classification problem.

Conclusions

I would select ADASYN, Extra Trees with SMOTE, or Extra Trees with calibrated probabilities, highlighted in gold above, as my candidates to improve upon the classification work of the 2009 study of vinho verde wines from Portugal, highlighted in green. Any of these would perform well against the original 2009 SVR model.

My Support Vector Regression model, after tuning the hyperparameters and scaling the data with the standard scaler on each kfold, was able to obtain a ballpark resemblance to the accuracy of the 2009 study. However, my SVR model did not achieve the recall of the positive classes that the study authors did. This may be due to the fact that the authors' selection of key features was more technically sophisticated, and based on better domain knowledge of winemaking than I currently possess.

As my study progressed, however, I was able to demonstrate that classification models such as Extra Trees, paired with oversampling algorithms such as Adaptive Synthetic Sampling, Synthetic Minority Oversampling (SMOTE) and Mahalanobis Distance Oversampling (MDO), achieve a predictive power that exceeds that of regression models such as SVR, especially in their ability to predict the positive classes, which are the wines on the high and low end of the quality spectrum. To a professional wine taster, those wines that are excellent and those that are poor are of the greatest concern, and so they would naturally seek out the machine learning algorithm with the most success at classifying these categories.

It may well be that improvements in the availability and accessibility of machine learning models since 2009, as exampled by the sklearn and imblearn Python libraries, provide to the researcher a more refined toolbox than was available to the study authors in 2009. This would account for the considerable improvement I was able to achieve in Kappa score, in Mean Absolute Deviation, and in recall of the wine category ratings of 4 and 8.

Acknowledgements

I learned the syntactical outline of much of this code, especially with regard to imbalanced classification problems, from Dr. Jason Brownlee: https://machinelearningmastery.com/machine-learning-with-python/

I strongly recommend Dr. Brownlee's work to beginners such as myself.

Further Reading

Abdi, Lida, and Sattar Hashemi. "To combat multi-class imbalanced problems by means of over-sampling techniques." IEEE transactions on Knowledge and Data Engineering 28.1 (2015): 238-251.

Bi, Jinbo, and Kristin P. Bennett. "Regression error characteristic curves." Proceedings of the 20th international conference on machine learning (ICML-03). 2003.

Brownlee, Jason. Imbalanced Classification with Python: Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning. 1.3 ed., 2021.

Brownlee, Jason. Machine Learning Mastery with Python. 1.20 ed., 2020.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. "Modeling wine preferences by data mining from physicochemical properties." Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Han, Hui, Wen-Yuan Wang, and Bing-Huan Mao. "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning." International conference on intelligent computing. Springer, Berlin, Heidelberg, 2005.

Hand, David J. "Classifier technology and the illusion of progress." Statistical science 21.1 (2006): 1-14.

McHugh, Mary L. "Interrater reliability: the kappa statistic." Biochemia medica 22.3 (2012): 276-282.

Weiss, Gary M. et al. “Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?” DMIN (2007).